Running COMETS Analytics locally

Ewy Mathé, Ella Temprosa

2020-12-02

Introduction

COMETS Analytics support all cohort-specific analyses of the COMETS consortium. This collaborative work is done via the COMETS harmonization group activities. For more information, see the [COMETS website] (http://epi.grants.cancer.gov/comets/).

Data Input Format

The required input file shoud be in excel format with the following 5 sheets:

  1. Metabolites - from harmonized metabolites output
  2. SubjectMetabolites - abundances in columns and subject in rows
  3. SubjectData - other exposure and adjustment variables
  4. VarMap - maps the variables needed to conduct the cohort specific analysis. Specify the name of variables under CohortVariable column. if the VarReference has the same name in the cohort, leave blank
  5. Models - models used to conduct COMETS analysis. Outcomes, exposures and adjustment can specify multiple covariates delimited by spaces (ie: age bmi).
  6. Options - optional sheet containing model specific options

An example input file is available HERE.

Analysis Workflow for correlation analysis

1. Load Data

The first step of the analysis is to load in the data with the readCOMETSinput() function. Input for this function is an Excel spreadsheet, per the description above.

# Retrieve the full path of the input data
dir <- system.file("extdata", package="COMETS", mustWork=TRUE)
csvfile <- file.path(dir, "cometsInputAge.xlsx")

# Read in and process the input data
exmetabdata <- COMETS::readCOMETSinput(csvfile)
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## [1] "Metabolites sheet is read in"
## [1] "SubjectMetabolites sheet is read in"
## [1] "SubjectData sheet is read in"
## [1] "VarMap sheet is read in"
## [1] "Models sheet is read in"
## [1] "There are 14 categorical variables"
## [1] "Running Integrity Check..."
## Joining, by = "id"
## Joining, by = "hmdb_id"
## [1] "Input data has passed QC (metabolite and sample names match in all input files)"

To plot some the distribution of variances for each metabolite:

COMETS::plotVar(exmetabdata,titlesize=12)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

To plot the distribution of minimum/missing values:

COMETS::plotMinvalues(exmetabdata,titlesize=12)
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

2. Get Model Data

There are 2 ways to specify your model, batch or interactive. In Batch mode, models are specified in your input file. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run. Input for this function is the data input in the previous step:

exmodeldata <- COMETS::getModelData(exmetabdata,modlabel="1 Gender adjusted")

In Interactive mode, models are specified as parameters. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run. Input for this function is the data input in the previous step:

exmodeldata <- COMETS::getModelData(exmetabdata, modelspec="Interactive",
    colvars=c("age","bmi_grp"), where=c("age>40","bmi_grp>2"))
## [1] "Analysis will run on 'All metabolites'"
## [1] "Filtering subjects according to the rule(s) age>40 & bmi_grp>2. 279 of 1000 are retained"

3. Run Simple Correlation Analysis

The unstratified correlation analysis is run by calling the function runModel(). This function runs the model(s) that is(are) defined in the input data (Models tab).

excorrdata  <- COMETS::runModel(exmodeldata,exmetabdata,"DPP")

The output of the correlation analysis can then be compiled and output to an Excel file with the following function:

COMETS::OutputListToExcel(filename="DPP_corr.xlsx", excorrdata)

To view the first 3 lines of the correlation analysis output, simply type:

COMETS::showCorr(excorrdata,nlines=3)
##   run                   outcomespec exposurespec term        corr     p.value
## 1   1 _1_2_3_benzenetriol_sulfate_2          age  age 0.164624501 0.005846722
## 2   2      _1_2_dipalmitoylglycerol          age  age 0.068903188 0.251337451
## 3   3              _1_2_propanediol          age  age 0.001667259 0.977882521

To display the heatmap of the resulting correlation matrix, use the showheatmap function.

COMETS::showHeatmap(excorrdata)




To display the hierarchical clustering of the resulting correlation matrix, use the showHClust function. This diplay requires at least 2 rows and 2 columns in the correlation matrix.

exmodeldata<-COMETS::getModelData(exmetabdata,modelspec = "Interactive",colvars = c("bmi_grp","age"))
## [1] "Analysis will run on 'All metabolites'"
excorrdata  <- COMETS::runModel(exmodeldata,exmetabdata,"DPP")
COMETS::showHClust(excorrdata)

Results can be written to an output Excel file with the following command:

COMETS::OutputListToExcel("Model1.xlsx", excorrdata)

4. Run Stratified Correlation Analysis

The stratified correlation analysis is run by calling the function runModel(). This function runs the model(s) that is(are) defined in the input data (Models tab) or in interactive mode. In this example, exmodeldata includes an object scovs that specifies the stratification variable.

  exmodeldata2 <- COMETS::getModelData(exmetabdata,modelspec="Interactive",rowvars=c("lactose","lactate"),
    colvars=c("age","bmi_grp"),strvars="race_grp")
  excorrdata2  <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")

5. Linear regression

Run a linear regression using the lm function adjusting for age group with age and bmi group as the exposures.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   rowvars=c("lactose","lactate"), colvars=c("age","bmi_grp"))
  lm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="lm"))
  print(lm_results)
## $ModelSummary
##   run cohort        spec model outcomespec exposurespec wald.pvalue  r.squared
## 1   1    DPP Interactive           lactose          age  0.16451267 0.01467966
## 2   2    DPP Interactive           lactate          age  0.60149903 0.00165591
## 3   3    DPP Interactive           lactose      bmi_grp  0.02387982 0.02206445
## 4   4    DPP Interactive           lactate      bmi_grp  0.00166282 0.01641147
##   adj.r.squared     sigma     loglik       aic       bic   deviance df.residual
## 1    0.01171183 1.8029707 -2006.3702 4022.7404 4047.2792 3237.70036         996
## 2   -0.00135115 0.2882131  -172.8794  355.7588  380.2975   82.73452         996
## 3    0.01714526 1.7980076 -2002.6087 4019.2173 4053.5716 3213.43441         994
## 4    0.01146384 0.2863629  -165.4342  344.8684  379.2227   81.51171         994
##   nobs message             adjvars adjvars.removed adjspec outcome_uid
## 1 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 2 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
## 3 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 4 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
##         outcome exposure_uid             adj_uid
## 1 Alpha-Lactose          age age_grp.2;age_grp.3
## 2 L-Lactic acid          age age_grp.2;age_grp.3
## 3 Alpha-Lactose      bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid      bmi_grp age_grp.2;age_grp.3
## 
## $Effects
##   run outcomespec exposurespec      term     estimate   std.error  statistic
## 1   1     lactose          age       age  0.034697534 0.024961296  1.3900534
## 2   2     lactate          age       age -0.002083854 0.003990177 -0.5222460
## 3   3     lactose      bmi_grp bmi_grp.2 -0.047905160 0.134163189 -0.3570663
## 4   3     lactose      bmi_grp bmi_grp.3  0.360610471 0.145954081  2.4707118
## 5   3     lactose      bmi_grp bmi_grp.4  0.420224133 0.577141138  0.7281133
## 6   4     lactate      bmi_grp bmi_grp.2  0.034207637 0.021367743  1.6009008
## 7   4     lactate      bmi_grp bmi_grp.3  0.080743228 0.023245640  3.4734783
## 8   4     lactate      bmi_grp bmi_grp.4 -0.126241151 0.091919426 -1.3733892
##        p.value
## 1 0.1648232433
## 2 0.6016151626
## 3 0.7211179261
## 4 0.0136512756
## 5 0.4667157218
## 6 0.1097166272
## 7 0.0005359045
## 8 0.1699410577
## 
## attr(,"ptime")
## [1] "Processing time: 0.16 sec"

6. Linear regression with glm

Run a linear regression using the glm function for the same variables as above. The Effects data frame will be the same as with lm, but the ModelSummary data frame will contain some different columns.

  glm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="glm"))
  print(all.equal(lm_results$Effects, glm_results$Effects))
## [1] TRUE

7. Logistic regression with glm

Run a logistic regression using the glm function with the binary variable nested_case as the outcome. Here we must specify model.options in the op list to ensure the correct family is used.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   rowvars="nested_case", colvars=c("lactose","lactate"))
  op <- list(model="glm", model.options=list(family="binomial"))
  glm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=op)

8. Run Analysis on all models defined in the input Excell sheet (‘super-batch’ mode)

All models desginated in the input file can be run with one command, and individual output Excel files or correlation results will be written in the current directory by default. The function returns a list of objects.

 allresults <- COMETS::runAllModels(exmetabdata,writeTofile=F)
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-17        
##   [4] rprojroot_1.3-2      corpcor_1.6.9        fs_1.4.2            
##   [7] rstudioapi_0.11      farver_2.0.3         remotes_2.2.0       
##  [10] subselect_0.15.2     prodlim_2019.11.13   fansi_0.4.1         
##  [13] lubridate_1.7.9      codetools_0.2-16     splines_4.0.2       
##  [16] mnormt_2.0.1         knitr_1.29           pkgload_1.1.0       
##  [19] jsonlite_1.7.0       pROC_1.16.2          caret_6.0-86        
##  [22] broom_0.7.0          cluster_2.1.0        compiler_4.0.2      
##  [25] httr_1.4.1           backports_1.1.7      assertthat_0.2.1    
##  [28] Matrix_1.2-18        lazyeval_0.2.2       cli_2.0.2           
##  [31] htmltools_0.5.0      prettyunits_1.1.1    tools_4.0.2         
##  [34] gtable_0.3.0         glue_1.4.1           reshape2_1.4.4      
##  [37] dplyr_1.0.0          Rcpp_1.0.5           cellranger_1.1.0    
##  [40] vctrs_0.3.1          gdata_2.18.0         nlme_3.1-148        
##  [43] iterators_1.0.12     crosstalk_1.1.0.1    psych_1.9.12.31     
##  [46] timeDate_3043.102    gower_0.2.2          xfun_0.19           
##  [49] stringr_1.4.0        ps_1.3.3             testthat_2.3.2      
##  [52] lifecycle_0.2.0      gtools_3.8.2         devtools_2.3.1      
##  [55] dendextend_1.14.0    MASS_7.3-51.6        scales_1.1.1        
##  [58] ipred_0.9-9          TSP_1.1-10           parallel_4.0.2      
##  [61] RColorBrewer_1.1-2   yaml_2.2.1           memoise_1.1.0       
##  [64] heatmaply_1.1.1      gridExtra_2.3        ggplot2_3.3.2       
##  [67] rpart_4.1-15         stringi_1.4.6        gclus_1.3.2         
##  [70] desc_1.2.0           foreach_1.5.0        seriation_1.2-8     
##  [73] caTools_1.18.0       pkgbuild_1.0.8       lava_1.6.7          
##  [76] rlang_0.4.6          pkgconfig_2.0.3      bitops_1.0-6        
##  [79] evaluate_0.14        lattice_0.20-41      purrr_0.3.4         
##  [82] labeling_0.3         recipes_0.1.13       htmlwidgets_1.5.1   
##  [85] processx_3.4.3       tidyselect_1.1.0     plyr_1.8.6          
##  [88] magrittr_1.5         R6_2.4.1             gplots_3.0.4        
##  [91] generics_0.0.2       COMETS_1.5.0.0       pillar_1.4.6        
##  [94] withr_2.2.0          survival_3.1-12      nnet_7.3-14         
##  [97] tibble_3.0.3         crayon_1.3.4         KernSmooth_2.23-17  
## [100] tmvnsim_1.0-2        plotly_4.9.2.1       rmarkdown_2.5       
## [103] viridis_0.5.1        usethis_1.6.1        grid_4.0.2          
## [106] readxl_1.3.1         data.table_1.12.8    callr_3.4.3         
## [109] ModelMetrics_1.2.2.2 digest_0.6.25        webshot_0.5.2       
## [112] tidyr_1.1.1          ISwR_2.0-8           stats4_4.0.2        
## [115] munsell_0.5.0        registry_0.5-1       viridisLite_0.3.0   
## [118] sessioninfo_1.1.1